(ِAmazon_mens_clothing)

by (Mostafa khalil)

Table of Contents

Introduction

The clothes are among the best category sellers on the Internet after scrabe amazon mens clothing pages i collected this data into csv file, this dataset about mens clothing from 300 pages on amazon website 48 result per page. it contain over 14000 rows with 6 columns

to get Amazon_mens clothing page

will analyzing the data to answer the questions below

  • what is The highest price product and what is the highest 10 price
  • what is the lowest price product and what is the lowest 10 price
  • what the average price in mens clothing category?
  • what is the Most frequent products
  • how many products have five stars rating ?
  • whow many products have rating less than 2 ?
  • wthat is the highest product no_review and what is top 10?
  • how many products have no_review >1161 ?
  • find the isntances when products have no_review > and price < 400$ ?
  • products that contains jeans?
  • which rating has the highest number of no_review?
  • what is the best product that have high rating and highr number of review?

What is the structure of your dataset?

this dataset contains 14352 rows of product with 6 columns (product,rating,no_review,price,image,product_url)

  • product : the name of product in amazon
  • rating : from 1 to 5 stars mean 5 is very good
  • no_review : the number of people who Participate in the classification
  • price : price of each product
  • image : url of product images
  • product_url : page of each product on amazon ### What is/are the main feature(s) of interest in your dataset?

the main feature will be price -- when price be expensive and cheap

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

rating and no_review will be help me to investigation the price and try to understand the relationship between price and this two columns

Import libraries

to display all columns and rows

Data Wrangling

1-Gathring the data

2-Assessment

check for null

check for duplicated

3-Cleaning the Data

Make acopy

Define

code

Test

Define

Code

Test

Exploratory Data Analysis

Load in the dataset and describe its properties through the questions .

  • what is The highest price product and what is the highest 10 price
  • what is the lowest price product and what is the lowest 10 price
  • what the average price in mens clothing category?
  • what is the Most frequent products
  • how many products have five stars rating ?
  • whow many products have rating less than 2 ?
  • wthat is the highest product no_review and what is top 10?
  • how many products have no_review >116 ?
  • find the isntances when products have no_review >500 and price < 400$ ?
  • products that contains jeans?
  • which rating has the highest number of no_review?
  • what is the best product that have high rating and highr number of review?

Research question 1 : what is The highest price product and what is the highest 10 price

the highest product in price is filson lined wool packet coat with 795.00 $

Research question 2 : what is the lowest price product and what is the lowest 10 price

the lowest product price are both Hanes Men's Short-Sleeve Beefy T-Shirt with Pocket and Hanes Beefy-T Mens Pocket T-Shirt with 4.80 $

Research question 3 :what the average price in mens clothing category

Research question 4 :what is the Most common products

We find the most common product is 'Anchor MSJ Men's 50s Male Clothing Rockabilly Style Cotton Mens Shirts Short Sleeve Fifties Bowling Casual Button-Down Shirts' Repeated 11 times then 'Funny Guy Mugs Men's Hawaiian Print Button Down Short Sleeve Shirts' repeated 8 times

Research question 5 :how many products have five stars rating ?

there are 421 product that have 5 stars or 5 rating

Research question 5: How many products have rating less than 2 ?

there are 11 products that have rating less than 2

wthat is the highest product no_review and what is top 10?

if we look at the top 10 no_review We will find that the Gildan Men's Crew T-Shirts, Multipack is the highest number of review with 126806 as we see in bottom plot
and in General if we want to analyze the number of review in top 10 products in number of review we will see that the distributin of review is hight from 60000 to 80000 and Then it begins to descend until to reach to 126806 as we see in upper plot that mean the review from 60000 to 80000 is The highest in data

Research question 6 : what is the mean value of no_review column

Research question 7 : how many products have no_review >1161

Research question 8 : find the isntances when products have no_review >1161 and price >42$ ? and how many of it

there are 253 products that thier no_review > 1161 and thier price > 42

Research question 9 : products that contains jeans?

there are 246 products that have jeans in they name

Research question 9 : which rating has the highest number of no_review

the highest number of review 126806.0 is in 4.6 rating

Research question 11 : what is the best product that have high rating and highr number of review?

We conclude that the best products ever are THE COMFY Original | Oversized Microfiber then Fruit of the Loom Men's Coolzone Boxer Briefs Assorted Colorsthen Hanes mens Max Cushion Double Tough Crew Socks, 12-pair Pack

at this point We have finished answering the questions we asked at the beginning.

now lets analyze the data in general using visualizations skills

visualizations

Univariate Exploration

In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

lets start with price column

this scatterplot displays the relationship between rating on x-axis and price on y-axis and We can conclude that the products with a rating higher than 3 are the highest in terms of price, and the higher the rating, the higher the price of the product, and this is normal

this scatterplot displays the relationship between price on x-axis and number or reviews on y-axis and We can conclude that The number of reviews will be more higher in the products whose price is less than 150 dollar , and the lower the price, the greater the number of reviews, and it peaks in the products of less than 50, noting that the average price is 43

this lineplot displays the relationship between rating on x-axis and number or reviews on y-axis and We can conclude that The the products that have rating greater than 4 and less than 5 are have highest number of reviews

Conclusions